This session is a very brief introduction to R and RStudio for beginners, with reference to Civil Service People (CSPS) Data.
There is a lot of extra material that could be covered, but is left out for brevity.
We’ll be developing this guidance and making it freely available on the web.
The annual CSPS produces a lot of data each year. Departments are provided with summary reports, but can access response-level data (‘microdata’) to perform their own in-depth analyses.
Many tools like Excel, SPSS and Stata are used across government to analyse the microdata. Many of these tools are proprietary and require expensive licenses. This variety can make it tricky for analysts to share approaches between departments and even within them.
Activity
Of course, every analyst and every department is welcome to use the tools that are available to them, that they understand and that get the job done.
Having said this, we advocate specifically the statistical programming language R and the RStudio code editor.
Why R? It:
RStudio is a popular and well-supported piece of software for editing and running R code for both beginners and advanced users. It’s also free of charge and the company behind it is a public benefit coprporation with a commitment to producing open source software.
The CSPS team are developing some R-based tools for analysing CSPS data specifically. You will be able to download a package that contains common functions for analysing CSPS data. This will help provide consistency for analysis and reporting and make tasks easier to perform and more reproducible. The tools will be shared in the open for anyone to use and so that anyone can help to improve them.
Before starting, you should download:
Both are free, but you might need to get in touch with your IT team to get them insalled to your computer.
Open RStudio – its icon is a white letter ‘R’ in a blue circle: <img src=“img/rstudio-icon.png” width="5%>
When you open RStudio for the first time, you’ll see the window is split into three ‘panes’, which are numbered below:
Your window may not look exactly like this one, depending on your operating system.
Labelled in the image are:
We don’t need to concern ourselves with every button and tab for now.
There are many benefits to having one folder per analytical project. It means your work is more:
data/dataset.csv rather than file/path/on/my/personal/machine/that/you/cannot/access.csvRStudio has a system that helps you set this up. You can create an ‘RStudio Project’ like this:
csps-r for this session)This creates a folder where you specified that contains an RStudio Project file (extension ‘.Rproj’). This folder is the ‘home’ of your project and this is where you should house all the files and code that you need.
There are many ways to organise this folder. For this session we’ll keep things simple:
csps-r/ # the project folder
├── csps-r.Rproj # R Project file
├── data/ # read-only raw data
├── output/ # processed data
└── training.R # R script files
data and output – in your Project folder (we’ll be using these later)To access your RStudio Project in future, navigate to the project folder and double-click your R Project file, which has the .Rproj extension (e.g. your-project.Rproj). It will open in the same state that you left it when you last closed it.
You’ll write your code into a special text file called an R script, which has the extension .R.
Having opened the R Project (.Rproj) file for your analysis, open a new script by clicking File > New File > R script. A new blank script will appear in a new pane in the upper left of the RStudio window.
You can type or copy and paste code into this document. This serves as a record of the actions you used to analyse the data step-by-step.
Tip
How do you actually run some R code? Let’s start with a small calculation.
First, we’ll add two numbers together. Type the calculation 1 + 1 into your script:
To execute it, make sure your cursor is on the same line as the code and press Command+Enter on a Mac or Control+Enter on a PC (there’s also a ‘Run’ button in the upper right of the script pane). You can also run multiple bits of code by highlighting selected lines and then running it.
What happened when you ran the code? The following was printed to the console in the lower-left pane of RStudio:
[1] 2
Great, we got the answer 2, as expected. (The number in square brackets is related to the the number of items retuned in the answer and doesn’t concern us right now.)
Tip
File > Save or use the Control+S or Command+S shortcutsThis is good, but ideally we want to store objects (values, tables, plots, etc), so we can refer to them in other pieces of code later.
You do this in R with a special operator: the ‘assignment arrow’, which is written as <-. The shortcut for it is Alt+- (hyphen).
For example, we can assign 1 + 1 to the name my_num with <-. Execute the following code:
Hm. Nothing printed out in the console. Instead the object is now stored in your ‘environment’ – see the top right pane in RStudio:
You can now refer to this object by name in your script. For example, you can print it:
[1] 2
Tip
my_num is euivalent to print(my_num)print() throughout to be more explicitThe real benefit to this is that you don’t have to repeat yourself every time you want to use that particular calculation. For example, you can refer to it in new expressions:
[1] 10
Tip
var_mean and var_medianActivity
val1 that stores the value 543val2 that stores the value 612calc that is the multiplication (*) of val1 and val2calc?01:00
We stored a numeric value in the last section. We can do more than just store one item of data at a time though.
This next chunk of code combines multiple elements with the c() command. This kind of multi element object is called a ‘vector’.
Here’s a vector that contains text rather than numbers. You put character strings inside quotation marks ("") to identify them as such. Numbers don’t need to be in quotation marks (unless you want them to be stored as text, specifically).
# Create an example vector
dept_names <- c("DfE", "DHSC", "DfT") # combine some values
print(dept_names) # have a look at what the object contains[1] "DfE" "DHSC" "DfT"
You can see what ‘class’ your object is at any time with the class() function.
[1] "numeric"
[1] "character"
Tip
c(1, 2, 3)1:3So we’ve create objects composed of a single values (my_num) and a vector of values (dept_names).
The next step would be to combine a number of vectors together to create a table with rows and columns. Tables of data with rows and columns are called ‘data frames’ in R and are effectively a bunch of vectors of the same length stuck together.
Here’s an example of a data frame built from scratch:
# Create a data frame of selected departments
dept_info <- data.frame(
dept = dept_names, # use vector from earlier
headcount = c(6900, 8300, 15000),
responsibility = c("Education", "Health", "Transport")
)
print(dept_info) # see the data frame dept headcount responsibility
1 DfE 6900 Education
2 DHSC 8300 Health
3 DfT 15000 Transport
Can you see how the data frame is three vectors (dept, headcount and responsibility) of the same length (3 values) arranged into columns? The function data.frame() bound these together into a table format. Let’s check the class:
[1] "data.frame"
R is capable of building very complex objects, but tabular data with rows and columns is ubiquitous and it’s how the CSPS data is stored. We’ll be focusing on data frames for now.
You’ve been using functions already: print(), c(), data.frame(), class().
A function is a reproducible unit of code that performs a given task, such as reading a data file or fitting a model. There are any of these built into R already, but you can also download ‘packages’ of functions and you can also create your own.
Functions are written as the function name followed by brackets. The brackets contain the ‘arguments’, which are like the settings for the function. One argument might be be a filepath to some data, another might describe the colour of points to be plotted. They’re separated by commas.
So a generic function might look like this:
# This isn't a real function; don't run it
function_name(
data = my_data,
colour = "red",
option = 5
) Note that you can break the function over several lines to improve readability and so you can comment on individual arguments. You can put your cursor on any of these lines and run it. You don’t have to highlight the whole thing.
You can use type a question mark followed by a function name to learn about its arguments. This will appear in a help file in the bottom right pane. For example, ?plot().
Tip
# Define a function add two numbers
add_nums <- function(val_a, val_b) {
val_a + val_b
}
add_nums(val_a = 3, val_b = 4) # use the [1] 7
Functions can be bundled into packages. A bunch of packages are pre-installed with R, but there are thousands more available for download. These packages extend the basic capabilities of R or improve them.
Packages can be installed to your computer using the install.packages() function. This automatically fetches and downloads packages from a centralised package database on the internet called CRAN. CRAN accepts packages that meet strict quality criteria, so you can be assured that what you download is of an appropriate standard.
Tip
We’re going to use a few packages to help us:
Tip
install.packages("tidyverse")So we can install all these packages and more at once with:
You only need to run the installation function once per package. At that point it’s downloaded to your machine, so you don’t need to install it again.
You use library("package_name") to tell R to make available the functions from a package in your script. You will need to run this once at the start of your session to use that package’s functions.
So now we have the tidyverse packages installed we can call them with the library() function so we can use them.
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Sometimes a message will be printed to tell you a bit more about the package, which is what happens for {dplyr}.
We can start using functions from these packages now that they’re loaded.
It’s good practice to write the library() lines near the top of your script file so that others know which packages are being used in the script.
We aren’t using real CSPS data for these exercises. Instead, we’ll be using a ‘synthetic’ version that mimics the 2019 data.
In short, this means that the data distributions within the variables are preserved, but no response represents a real individual. This means we can get some realistic-looking outputs without any response being from a real individual.
We’ve also restricted the number of variables (columns) and rows (responses) to keep the data set relatively small, and have added a fake unique ID value.
Ordinarily we would send you the data for your organisation on request. For this session, we’ve prepared the synthetic data set as a Stata-format (.dta) file.
You can download it to your machine with the download.file() function. Save it to the data/ folder of your project using the destfile argument.
The variables are in the synthetic data set are:
The {haven} package has a function called read_stata() that you can pass the file path to. Let’s read in the data with this function and name it ‘data’.
This will read the data in as a ‘tibble’, a fancier type of data frame that’s used by the tidyverse packages. For example, when printed to the console, tibbles use colour coding and are truncated to fit.
Activity
It’s good to preview the data and check it looks like what we expected.
The {dplyr} package that we loaded earlier has a function called glimpse(), which tells you about the structure of the data.
Observations: 11,555
Variables: 38
$ ResponseID <dbl> 100000, 100001, 100002, 100003, 100004, 100005, 10000…
$ OverallDeptCode <chr> "ORGA", "ORGA", "ORGA", "ORGA", "ORGA", "ORGA", "ORGA…
$ B01 <dbl+lbl> 4, 4, 3, 5, 5, 5, 4, 4, 4, 5, 4, 4, 5, 4, 4, 4, 3…
$ B02 <dbl+lbl> 4, 4, 4, 5, 5, 5, 4, 3, 4, 4, 4, 4, 4, 4, 4, 4, 3…
$ B03 <dbl+lbl> 3, 4, 3, 5, 5, 5, 4, 3, 3, 4, 2, 4, 2, 4, 4, 2, 2…
$ B04 <dbl+lbl> 3, 4, 4, 5, 4, 5, 4, 2, 4, 4, 1, 4, 4, 4, 3, 1, 2…
$ B05 <dbl+lbl> 4, 4, 4, 5, 4, 5, 3, 4, 5, 5, 4, 3, 5, 3, 3, 4, 3…
$ B47 <dbl+lbl> 4, 3, 4, 4, 5, 5, 4, 3, 3, 4, 2, 4, 3, 4, 4, 4, 3…
$ B48 <dbl+lbl> 4, 3, 4, 5, 5, 5, 4, 4, 3, 4, 2, 4, 5, 4, 4, 4, 3…
$ B49 <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 3, 2, 2, 2, 2, 2, 4, 2, 3, 2…
$ B50 <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 4, 2, 3, 3, 2, 4, 4, 4, 3, 2…
$ B51 <dbl+lbl> 4, 2, 4, 4, 5, 5, 4, 4, 2, 3, 3, 2, 4, 4, 2, 3, 3…
$ E03 <dbl+lbl> 1, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4, 1, 4…
$ E03_GRP <dbl+lbl> 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2…
$ E03A_01 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA…
$ E03A_02 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, 1, NA, NA, NA, NA…
$ E03A_03 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_04 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_05 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_06 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_07 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_08 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_09 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_10 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_11 <dbl+lbl> 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ E03A_12 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_13 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_14 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_15 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ E03A_16 <dbl+lbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ W01 <dbl> 7, 10, 9, 10, 10, 10, 7, 9, 8, 10, 8, 5, 8, 9, 3, 8, …
$ W02 <dbl> 7, 10, 9, 10, 10, 10, 7, 10, 7, 10, 10, 5, 8, 11, 3, …
$ W03 <dbl> 8, 10, NA, 10, 11, 10, 7, 8, 6, 9, 9, 4, 9, 3, 11, 8,…
$ W04 <dbl> 7, 1, 6, 5, 2, 4, 1, 8, 6, 5, 2, 9, 8, 1, 8, 4, 1, 4,…
$ J03 <dbl+lbl> 18, 1, 1, 1, 1, 1, 1, 1, NA, NA, 1, 1, 1, 1, 1, 1…
$ Z02 <dbl+lbl> 2, 3, 2, 4, 3, 4, 2, 2, NA, 4, 4, 3, 4, 5, 3, 2, …
$ ees <dbl> 0.75, 0.35, 0.75, 0.80, 1.00, 1.00, 0.75, 0.65, 0.35,…
$ mw_p <dbl> 0.6, 1.0, 0.6, 1.0, 1.0, 1.0, 0.8, 0.4, 0.8, 1.0, 0.6…
The top of the output tells us there’s 11,555 observations (rows) and 38 variables (columns).
Column names are then listed with the data type and the first few examples. For example, ‘OverallDeptCode’ contains character class (<chr>) data in the form of strings. Column names starting with ‘B’, ‘E’, ‘J’ and ‘Z’ are question codes and they contain responses expressed in numeric form, so they’re of class ‘double’ (<dbl>).
The numbers encode certain responses. For example, 1 means ‘strongly disagree’ and 5 means be ‘strongly agree’ for the ‘B’ series of questions.
How do we know what all the numeric values mean? You’ll see that a number of the columns have the label class (<lbl>) too. This means that the column carries additional ‘attributes’ that give the corresponding labels for the values.
Labels aren’t used that freuqently in R data frames, but are used in programs like Stata and SPSS. Since we’ve read in a Stata file, we’ve got these labels available to us.
You can also see that there are also lots of NA values. R uses NA to mean ‘not available’ – the data are missing. In this case, it means that the respondent didn’t supply an answer for that question.
Another way of expressing this is to print() to the console.
# A tibble: 11,555 x 38
ResponseID OverallDeptCode B01 B02 B03 B04 B05 B47
<dbl> <chr> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l> <dbl+l>
1 100000 ORGA 4 [Agr… 4 [Agr… 3 [Nei… 3 [Nei… 4 [Agr… 4 [Agr…
2 100001 ORGA 4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 3 [Nei…
3 100002 ORGA 3 [Nei… 4 [Agr… 3 [Nei… 4 [Agr… 4 [Agr… 4 [Agr…
4 100003 ORGA 5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str… 4 [Agr…
5 100004 ORGA 5 [Str… 5 [Str… 5 [Str… 4 [Agr… 4 [Agr… 5 [Str…
6 100005 ORGA 5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str… 5 [Str…
7 100006 ORGA 4 [Agr… 4 [Agr… 4 [Agr… 4 [Agr… 3 [Nei… 4 [Agr…
8 100007 ORGA 4 [Agr… 3 [Nei… 3 [Nei… 2 [Dis… 4 [Agr… 3 [Nei…
9 100008 ORGA 4 [Agr… 4 [Agr… 3 [Nei… 4 [Agr… 5 [Str… 3 [Nei…
10 100009 ORGA 5 [Str… 4 [Agr… 4 [Agr… 4 [Agr… 5 [Str… 4 [Agr…
# … with 11,545 more rows, and 30 more variables: B48 <dbl+lbl>, B49 <dbl+lbl>,
# B50 <dbl+lbl>, B51 <dbl+lbl>, E03 <dbl+lbl>, E03_GRP <dbl+lbl>,
# E03A_01 <dbl+lbl>, E03A_02 <dbl+lbl>, E03A_03 <dbl+lbl>, E03A_04 <dbl+lbl>,
# E03A_05 <dbl+lbl>, E03A_06 <dbl+lbl>, E03A_07 <dbl+lbl>, E03A_08 <dbl+lbl>,
# E03A_09 <dbl+lbl>, E03A_10 <dbl+lbl>, E03A_11 <dbl+lbl>, E03A_12 <dbl+lbl>,
# E03A_13 <dbl+lbl>, E03A_14 <dbl+lbl>, E03A_15 <dbl+lbl>, E03A_16 <dbl+lbl>,
# W01 <dbl>, W02 <dbl>, W03 <dbl>, W04 <dbl>, J03 <dbl+lbl>, Z02 <dbl+lbl>,
# ees <dbl>, mw_p <dbl>
The output is displayed in table format, but is truncated to fit the console window (this prevents you from printing millions of rows to the console!). You can see the labels are printed alongside the values in this view.
If you want to see the whole datset you could use the View() function:
This opens up a read-only tab in the script pane that displays your data in full. You can scroll around and order the columns by clicking the headers. This doesn’t affect the underlying data at all.
You can also access this by clicking the little image of a table to the right of the object in the environment pane (upper-right).
We’re going to use a number of functions from the {dplyr} package, which we loaded earlier, to practice some data manipulation.
Functions in the tidyverse suite of packages are usually verbs that describe what they’re doing. We’ll look at select() and filter(), for example.
We won’t have time to go thorugh all of the functions and their variants, but you should get a flavour of what’s possible.
Firstly, we can select() columns of interest. This means we can return a version of the data set composed of a smallernumber of columns. This can be helpful for a number of reasons, but one simple one is so that we can focus on specfic variables of interset.
The {dplyr} functions take the data frame as their first argument.
Note that we can also rename columns as we select them with the format new_name = old_name. (Alternatively there is a rename() function that only renames columns.)
# Return specific columns
select(
data, # the first argument is the data
Z02, ethnicity = J03 # then the columns to keep
)# A tibble: 11,555 x 2
Z02 ethnicity
<dbl+lbl> <dbl+lbl>
1 2 [eo] 18 [Any other background]
2 3 [SEO/HEO] 1 [English/Welsh/Scottish/Northern Irish/British]
3 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British]
4 4 [G6/7] 1 [English/Welsh/Scottish/Northern Irish/British]
5 3 [SEO/HEO] 1 [English/Welsh/Scottish/Northern Irish/British]
6 4 [G6/7] 1 [English/Welsh/Scottish/Northern Irish/British]
7 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British]
8 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British]
9 NA NA
10 4 [G6/7] NA
# … with 11,545 more rows
Note that the order you select the columns is the order they’ll appear when they print.
Instead of naming columns to keep, you can specify columns to remove by prefixing the column name with a - (minus).
Tip
data) remains unchanged, despite us having selected some columnsdata <- select(data, B01) would overwrite our original data objectTo save time you can use some special select() helper functions. For example, you can select a column that contains() or starts_with() certain strings. This is useful if you have lots of columns that share a simmilarity in their names, like the CSPS questions (e.g. B01, B02, etc, all start with “B”).
# A tibble: 11,555 x 5
ResponseID W01 W02 W03 W04
<dbl> <dbl> <dbl> <dbl> <dbl>
1 100000 7 7 8 7
2 100001 10 10 10 1
3 100002 9 9 NA 6
4 100003 10 10 10 5
5 100004 10 10 11 2
6 100005 10 10 10 4
7 100006 7 7 7 1
8 100007 9 10 8 8
9 100008 8 7 6 6
10 100009 10 10 9 5
# … with 11,545 more rows
Activity
select() to return all the ‘B’ series columns (B01, B02, etc)02:00
Now to filter the rows of the data set based on certain criteria.
We’re going to make use of some logical operators for filtering our data. These return TRUE or FALSE depending on the statement’s validity.
| Symbol | Meaning | Example |
|---|---|---|
== |
Equal to | 5 == 2 + 3 returns TRUE |
!= |
Not equal to | 5 != 3 + 3 returns TRUE |
%in% |
Shortcut to match to a vector | 4 %in% c(2, 4, 6) returns TRUE |
>, < |
Greater than, less than | 2 < 3 returns TRUE |
>=, <= |
Equal or greater than, equal or less than | 5 <= 5 returns TRUE |
& |
And (helps string together multiple filters) | 1 < 2 & 5 == 5 returns TRUE |
| |
Or (helps string together multiple filters) | 1 < 2 | 5 == 6 returns FALSE (only one of them is true) |
R also has some special shortcut functions for come logical checks. For example:
| Symbol | Meaning | Example |
|---|---|---|
is.numeric() |
Is the content numeric class? | is.numeric(10) returns TRUE |
is.character() |
Is the content character class? | is.character("Downing Street") returns TRUE |
is.na() |
Is the content an NA? |
is.na(NA) returns TRUE |
You can negate these funtions by preceding them with a !, so is.na(NA) returns TRUE but !is.na(NA) returns FALSE.
Let’s start by creating an object that contains the data filtered for senior civil servants (where variable Z02 equals 5) from two of the organisations.
See how there are two filter statements: Z02 == 5 and Organisation %in% c("ORGB", "ORGC")? We’re asking for both of these things to be true by using the & operator between them.
Notice that we used %in% to match to a vector of department names. These are stored as character strings, so we put them in quotation marks.
Let’s check the columns of interest to see if it worked:
# A tibble: 22 x 2
OverallDeptCode Z02
<chr> <dbl+lbl>
1 ORGB 5 [scs]
2 ORGB 5 [scs]
3 ORGB 5 [scs]
4 ORGB 5 [scs]
5 ORGB 5 [scs]
6 ORGB 5 [scs]
7 ORGB 5 [scs]
8 ORGB 5 [scs]
9 ORGB 5 [scs]
10 ORGB 5 [scs]
# … with 12 more rows
Activity
filter() to return senior civil servants in Org A only02:00
Now to create new columns. The function name is mutate(): we’re ‘mutating’ our dataframe by budding a new column where there wasn’t one before. Often you’ll be creating new columns based on the content of columns that already exist, like adding the contents of one to another.
One relevant use of this for the CSPS is to create dummy columns. If certain conditions are met in other columns, we can put a ‘1’ in the dummy column, else we can put ‘0’ if it’s not met.
So we could create a dummy column that flags when a respondent is a SEO/HEO grade
# Add a column that gets a 1 when the condition is true
data_dummy <- mutate(
data,
dummy = ifelse( # create a new column called 'dummy'
test = Z02 == 3 & J03 %in% 1:4, # test this condition
yes = 1, # if true, put a 1 in the dummy column
no = 0 # otherwise put a 0 in the column
)
)
# See if it worked
select(data_dummy, Z02, J03, dummy)# A tibble: 11,555 x 3
Z02 J03 dummy
<dbl+lbl> <dbl+lbl> <dbl>
1 2 [eo] 18 [Any other background] 0
2 3 [SEO/HEO] 1 [English/Welsh/Scottish/Northern Irish/British] 1
3 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British] 0
4 4 [G6/7] 1 [English/Welsh/Scottish/Northern Irish/British] 0
5 3 [SEO/HEO] 1 [English/Welsh/Scottish/Northern Irish/British] 1
6 4 [G6/7] 1 [English/Welsh/Scottish/Northern Irish/British] 0
7 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British] 0
8 2 [eo] 1 [English/Welsh/Scottish/Northern Irish/British] 0
9 NA NA 0
10 4 [G6/7] NA 0
# … with 11,545 more rows
Activity
Use mutate() to create a dummy column where:
5) to both B01 and B02 get a 1002:00
This function is particularly useful for the CSPS data if we want to overwrite our numeric values with their corresponding text labels. Fortunately, the {haven} package that we loaded earlier has a function that replaces the numeric values with their labels: as_factor().
We want to apply this only to the columns that are numeric. Fortunately there’s a variant of mutate() called mutate_if(). In mutate_if() You can use a logical statement, like we did for filter()
# Add a column that gets a 1 when the condition is true
data_lbl <- mutate_if(
data,
is.numeric, # if the column is numeric
haven::as_factor # then apply the as_factor function
)
data_lbl_chr <- mutate_all(data_lbl, as.character)
glimpse(data_lbl_chr)Tip
as_factor() – how can we resolve this?package::function()We can use the join functions to merge two data frames together on a common column.
Let’s create a small trivial data frame that provides a lookup from department codes to full department names and merge it into our CSPS data.
Tip
tibble() function from {dplyr} to build the data framelookup <- tibble(
OverallDeptCode = c("ORGA", "ORGB", "ORGC"),
dept_full_name = c("Dept for A", "Ministry of B", "C Agency")
)
print(lookup)# A tibble: 3 x 2
OverallDeptCode dept_full_name
<chr> <chr>
1 ORGA Dept for A
2 ORGB Ministry of B
3 ORGC C Agency
We want what is perhaps the most common join: left_join(). It gives you all the rows from the ‘left’ data set and merges on the columns from the ‘right’.
Here’s what we’re doing (gif by Garrick Aden-Buie):
To do this, we pass two data frames to arguments x (‘left’) and y (‘right’) and provide the column name to join by.
data_join <- left_join(
x = data, # the original data set ('left')
y = lookup, # the data to merge on ('right')
by = "OverallDeptCode" # the common column between them
)Warning: Column `OverallDeptCode` has different attributes on LHS and RHS of
join
You might get a message saying that the attributes for our joining column aren’t the same. That’s okay; it’s because the column in data (the data set on the ‘LHS’, or ‘left-hand side’, of the join) has attributes, but the one in lookup (on the right-hand side) doesn’t.
Let’s check to see if rows from both data frames are present in the joined data set:
# A tibble: 11,555 x 4
ResponseID B01 OverallDeptCode dept_full_name
<dbl> <dbl+lbl> <chr> <chr>
1 100000 4 [Agree] ORGA Dept for A
2 100001 4 [Agree] ORGA Dept for A
3 100002 3 [Neither agree nor disagree] ORGA Dept for A
4 100003 5 [Strongly agree] ORGA Dept for A
5 100004 5 [Strongly agree] ORGA Dept for A
6 100005 5 [Strongly agree] ORGA Dept for A
7 100006 4 [Agree] ORGA Dept for A
8 100007 4 [Agree] ORGA Dept for A
9 100008 4 [Agree] ORGA Dept for A
10 100009 5 [Strongly agree] ORGA Dept for A
# … with 11,545 more rows
Success: the output has all the rows of the data data frame, plus the new one (dept_full_name) from the lookup data frame.
We’ve seen how to manipulate our data frame a bit. But we’ve been doing it one discrete step at a time, so your script might end up looking something like this:
data_select <- select(data, ResponseID, OverallDeptCode, B01, Z02)
data_filter <- filter(data_select, OverallDeptCode == "ORGA" & Z02 != 5)
data_mutate <- mutate(
data_filter,
positive = ifelse(B01 %in% c(4, 5), "Positive", "Not positive")
)
print(data_mutate)# A tibble: 1,060 x 5
ResponseID OverallDeptCode B01 Z02 positive
<dbl> <chr> <dbl+lbl> <dbl+lbl> <chr>
1 100000 ORGA 4 [Agree] 2 [eo] Positive
2 100001 ORGA 4 [Agree] 3 [SEO/HE… Positive
3 100002 ORGA 3 [Neither agree nor disag… 2 [eo] Not positi…
4 100003 ORGA 5 [Strongly agree] 4 [G6/7] Positive
5 100004 ORGA 5 [Strongly agree] 3 [SEO/HE… Positive
6 100005 ORGA 5 [Strongly agree] 4 [G6/7] Positive
7 100006 ORGA 4 [Agree] 2 [eo] Positive
8 100007 ORGA 4 [Agree] 2 [eo] Positive
9 100009 ORGA 5 [Strongly agree] 4 [G6/7] Positive
10 100010 ORGA 4 [Agree] 4 [G6/7] Positive
# … with 1,050 more rows
This is fine, but you will be creating a lot of intermediate objects to get to the final data frame that you want. This clutters up your environment and can fill up your computer’s memory if the data are large enough. You’re in danger of accidentally referring to the wrong object if you don’t name them well.
Instead, you could create one object that is built by chaining all the functions together in order.
We’ll use a special pipe operator – %>% – that will read as ‘take what’s on the left of the operator and pass it through to the next function’. In pseudocode:
A real example with our data might look like this:
data_piped <- data %>%
select(ResponseID, OverallDeptCode, B01, Z02) %>%
filter(OverallDeptCode == "ORGA" & Z02 != 5) %>%
mutate(positive = ifelse(B01 %in% c(4, 5), "Positive", "Not positive"))
print(data_piped)So the steps for creating the data_piped object are:
data objectThis is a bit like a recipe. And it’s easier to read.
You also repeat yourself fewer times. We only to name the data object once, a the very start. This minimises the chance that you’ll accidentally name the wrong object by mistake.
So far we’ve been wrangling but not analysing data. Let’s look at the summarise() function for some quick summaries.
A simple example might be to get the total count of responses in the data set and the mean of the engagement scores.
# A tibble: 1 x 2
total_count ees_mean
<int> <dbl>
1 11555 0.59
That’s good, but we can extend the summary so we get results grouped by some other variables. This is what the group_by() function does. You give group_by() the variables within which to summarise and you finish by calling ungroup() so that the subsequent functions don’t get applied to the groups.
So here’s a more comprehensive example that gets the mean count and mean EES grouped within departments and the Z02 variable (grade). It then filters out people who didn’t answer Z02 and uses a mutate() to suppress any mean EES values composed of less than 10 responses.
data %>%
group_by(OverallDeptCode, Z02) %>%
summarise(
total_count = n(),
ees_mean = round(mean(ees, na.rm = TRUE), 2)
) %>%
ungroup() %>%
filter(!is.na(Z02)) %>%
mutate(
ees_mean_supp = ifelse(
test = total_count < 10, yes = NA, no = ees_mean
)
)# A tibble: 15 x 5
OverallDeptCode Z02 total_count ees_mean ees_mean_supp
<chr> <dbl+lbl> <int> <dbl> <dbl>
1 ORGA 1 [AO/AA] 4 0.75 NA
2 ORGA 2 [eo] 131 0.7 0.7
3 ORGA 3 [SEO/HEO] 438 0.7 0.7
4 ORGA 4 [G6/7] 487 0.64 0.64
5 ORGA 5 [scs] 78 0.85 0.85
6 ORGB 1 [AO/AA] 4930 0.56 0.56
7 ORGB 2 [eo] 1513 0.580 0.580
8 ORGB 3 [SEO/HEO] 2359 0.64 0.64
9 ORGB 4 [G6/7] 133 0.76 0.76
10 ORGB 5 [scs] 18 0.570 0.570
11 ORGC 1 [AO/AA] 1 1 NA
12 ORGC 2 [eo] 17 0.79 0.79
13 ORGC 3 [SEO/HEO] 24 0.61 0.61
14 ORGC 4 [G6/7] 37 0.6 0.6
15 ORGC 5 [scs] 4 0.78 NA
We could have a whole separate session on visualising data.
The tidyverse package for plotting is called {ggplot2}. The ‘gg’ stands for ‘grammar of graphics’. It’s a system to build up a graphic using common components including:
You also supply aesthetic properties like size, colour, x and y locations.
To add these layers you use a +. Imagine you’ve created a blank canvas and you’re building up your image layer by layer. (This is different to using the pipe, %>%, which is passing information from the left-hand side to the right-hand side.)
The great thing about building plots with code is that you can produce them with the same styles very quickly without all the manual adjustments that might be required in some other programs.
For now, let’s look at a simple bar chart of the answers to question B01 using the ggplot() function from {ggplot2}.
# Prepare the data
plot_data <- data %>%
filter(!is.na(B01)) %>% # remove NAs
count(OverallDeptCode, B01) %>% # count() is a shortcut for summarising
mutate(
B01 = haven::as_factor(B01), # add the text labels
Department = OverallDeptCode
)
# Plot the data
ggplot(
plot_data,
aes(x = B01, y = n)) +
geom_col()What just happened? We:
plot_dataaes() (in this case, the x and y variables)geom_col(), to make a bar chartWe can spruce this up a little by adding on additional things like a theme or labels.
ggplot(plot_data, aes(x = B01, y = n)) +
geom_col(aes(fill = Department)) +
coord_flip() + # flip the axes
theme_light() + # apply a theme
scale_fill_brewer(palette = "Blues") + # set the bar colours
labs( # provide overall labels
title = "Most people say they're interested in their work",
subtitle = "This is true across all organisations",
caption = "Source: B01, synthetic CSPS data"
) +
xlab(NULL) + # remove x axis
ylab("Count of responses") # y axis titleBut we could also split each department’s results into a grid of small multiples, or ‘facets’, with facet_grid().
ggplot(plot_data, aes(x = B01, y = n)) +
geom_col() +
coord_flip() +
theme_light() +
labs(
title = "Most people say they're interested in their work",
subtitle = "This is true across all organisations",
caption = "Source: B01, synthetic CSPS data"
) +
xlab(NULL) +
ylab("Count of responses") +
facet_grid(
cols = vars(OverallDeptCode), # one column per department
scales = "free" # scales are relative to the facet
) We can also use {ggplot} to recreate the style of bar charts used in the PDF reports of People Survey results. First we need to process the data to get the data for ORG B, reshape it into a plottable format, and calculate percentages.
plot_data2 <- data %>%
filter(OverallDeptCode == "ORGB") %>%
select(B47:B51) %>%
mutate_all(haven::as_factor) %>% # convert the variables to factors
tidyr::pivot_longer( # turn the data into 'long' format
cols = everything(), # using all the columns
names_to = "question", # assign names to variable 'question'
values_to = "value" # assign values to variable 'value'
) %>%
tidyr::drop_na(value) %>% # drops any missing responses
count( # count the combinations of:
question, # question, and
value, # value, and
name = "response_count") %>% # give the count a specific name
add_count(
question, # add an extra count by question
wt = response_count, # summing the 'wt' variable
name = "question_count") %>% # give it a specific name
mutate(
pc = response_count/question_count, # calculate responses as % of question
value = forcats::fct_rev(value), # character strings are often better as
question = forcats::fct_rev( # factors when plotting, but sometimes
forcats::as_factor(question) # you need to reverse their 'order'
)
)
print(plot_data2)# A tibble: 25 x 5
question value response_count question_count pc
<fct> <fct> <int> <int> <dbl>
1 B47 Strongly disagree 455 10158 0.0448
2 B47 Disagree 904 10158 0.0890
3 B47 Neither agree nor disagree 2397 10158 0.236
4 B47 Agree 4410 10158 0.434
5 B47 Strongly agree 1992 10158 0.196
6 B48 Strongly disagree 1197 10152 0.118
7 B48 Disagree 1995 10152 0.197
8 B48 Neither agree nor disagree 3006 10152 0.296
9 B48 Agree 2950 10152 0.291
10 B48 Strongly agree 1004 10152 0.0989
# … with 15 more rows
We now have a dataset that has counted the responses for each question-value pair (response_count), the number of responses for each question (question_count) and a percentage response (pc), for questions B47-B52 for respondents in ORGB.
We can now plot this data, rather than Department we’ll be plotting the questions on the “x-axis” and our calculated percentage on the “y-axis” (we’ll actually flip these axis, but that’s one of the last things we do, so it’s best to still think of these in their original x-y positions). We can also add data labels, using geom_text(). The PDF survey reports use a colourblind friendly pink-green scale from the {RColorBrewer} package, which provides the palettes developed by the Color Brewer project. Finally, we apply some customisation to the theme to remove the axis titles, reposition the legend, give the legend keys an outline, and format the title text.
ggplot(plot_data2, aes(x = question, y = pc)) +
geom_col(aes(fill = value), width = 0.75, colour = "gray60", size = 0.2) +
geom_text(
aes(
label = scales::percent(pc, accuracy = 1),
colour = value),
position = position_fill(vjust = 0.5),
size = 3,
show.legend = FALSE) +
# geom_text adds text labels, we set the label aesthetic to the text
# we've also mapped the colour aesthetic to vary the label text's colour
# text positioning can be tricky, this is why the value factor was reversed
# when we created plot_data2 ¯\_(ツ)_/¯
scale_y_reverse() +
# reverse the y-axis so that strongly agree will be on the left-hand side
scale_fill_brewer(palette = "PiYG", direction = -1) +
# the PiYG palette is the same as is used in the highlights reports
# it is colourblind friendly, so recommended instead of basic red-green
scale_colour_manual(
values = c("Strongly agree" = "white",
"Agree" = "gray20",
"Neither agree nor disagree" = "gray20",
"Disagree" = "gray20",
"Strongly disagree" = "white")) +
# this provides the colours for the text labels, so that the labels for the
# 'strongly' values have white text, and the others have grey text
coord_flip() +
# flip the axis
labs(
title = "Employee engagement question results",
subtitle = "Almost two-thirds of staff are proud to work for ORG B",
caption = "Source: B47-B52, ORGB, synthetic CSPS data") +
theme_light() +
theme(
panel.grid = element_blank(),
# element_blank() removes an element from the plot
panel.border = element_blank(),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.title.y = element_blank(),
axis.ticks = element_blank(),
legend.position = "top",
legend.title = element_blank(),
legend.key.size = unit(1, "char"),
legend.margin = margin(1,0,0,0, "char"),
plot.title = element_text(face = "bold")){ggplot2} is a very powerful graphics package that can crate all sorts of charts. Check out the R Graph Gallery for some more examples.
r on StackOverflow, or even ask your own question
3.1 Comments
In an R script, any characters prefixed with a hash (
#) will be recognised as a comment. R will ignore these when you run your code.Comments are really helpful for letting people to understand what your code is doing. Try to keep a narrative going throughout your code to explain what it’s doing. Be explicit – it might be obvious to you right now why a certain line code is being written, but you might come back in a few months time and forget.
It’s also good to use comments to explain what each block of code is doing and to explain particular lines of code. Here’s an example of comments being used for some dummy code:
It’s also good to add the title, your name, date, etc, as comments at the top of your script so people know what the script is for when they open it.